This report looks is an update to the analysis shown on 1/14/2022. Most steps are the same with tweaking due to different data and outlier procedure
at exploring the relationship between wastewater and cases. There are four components to this analysis.
Removing putative outliers
Binning analysis
Smoothing signal
Statistical analysis
This report does not present any final answers but presents some very convincing heuristics.
“data Used from DSIWastewater package”
The two data sets used in this analysis are the Madison case data sourced from the Wisconsin DHS and wastewater concentration data produced by the Wisconsin State Laboratory of Hygiene. This wastewater data has entries every couple of days from 15 September 2020 to 19 April 2022.
| Date | Site | Cases | SevenDayMACases | sars_cov2_adj_load |
|---|---|---|---|---|
| 2020-09-15 | Madison | 65 | 191.00000 | 5.3527971 |
| 2020-09-19 | Madison | 129 | 142.14286 | 1.2473200 |
| 2020-09-22 | Madison | 65 | 120.00000 | 1.9444380 |
| 2020-09-23 | Madison | 122 | 106.28571 | 1.6489902 |
| 2020-09-24 | Madison | 90 | 94.85714 | 0.9406148 |
| 2020-09-25 | Madison | 92 | 87.42857 | 0.6352719 |
## min(Date) max(Date)
## 1 2020-09-15 2022-04-19
The case data has a strong weekend effect so for this section we look at a seven day smoothing of cases. The simple display of the data shows the core components of this story. First, wastewater data is noisy. And that there is a clear relationship between the two signals.
FirstImpressionDF <- FullDF%>%
filter(!is.na(sars_cov2_adj_load),
!is.na(Cases))#Removing NA
MinMaxNorm <- function(Vec, NormTerm, Base = NULL){
if(is.null(Base)){
Base <- Vec
}
maxBase = quantile(Base, .75, na.rm = TRUE)
minBase = quantile(Base, .25, na.rm = TRUE)
maxTerm = quantile(NormTerm, .75, na.rm = TRUE)
minTerm = quantile(NormTerm, .25, na.rm = TRUE)
Ret <- ((Vec-minBase)/(maxBase - minBase))*(maxTerm - minTerm) + minTerm
}
FirstImpression <- FirstImpressionDF%>%
ggplot(aes(x=Date))+#Data depends on time
geom_point(aes(y= Cases, color="Cases",info=Cases),size = 1)+
geom_line(aes(y=MinMaxNorm(sars_cov2_adj_load, Cases),
color="Wastewater",
info=sars_cov2_adj_load))+#compares sars_cov2_adj_load to Cases
geom_line(aes(y= SevenDayMACases,
color="Seven Day MA Cases",
info=Cases))+
labs(y="Reported cases")+
ColorRule
ggplotly(FirstImpression)%>%
add_lines(x=~Date, y=FirstImpressionDF$sars_cov2_adj_load,
yaxis="y2", data=FirstImpressionDF, showlegend=FALSE, inherit=FALSE) %>%
layout(yaxis2 = SecondAxisFormat,
legend=list(title=list(text=''),x = 1.15, y = 0.9),
margin = list(l = 50, r = 75, b = 50, t = 50, pad = 4))
Figure 1.1: Wastewater concentration and daily Covid-19 case data for Madison. A seven day moving average of cases is used to reduce a day of the week effect.
#To remove weekend effects we are looking at the 7 day smoothing of cases.
Looking at the wastewater measurements we observe there were some points many times larger than adjacent values hinting at them being outliers. We used the adjacent 10 values on each side and marked points 2.5 standard deviations away from the group mean as outliers.
#default pass to IdentifyOutliers
#method="SD", align="center", n = 5, Bin = 21, Action = "Flag"
source("OutlierDetectionFuncs.R")
ErrorMarkedDF <- FullDF%>%#
mutate(FlagedOutliers = IdentifyOutliers(sars_cov2_adj_load, Action = "Flag"),
#Manual flagging that method misses due to boundary effect with binning
NoOutlierVar = ifelse(FlagedOutliers, NA, sars_cov2_adj_load))
#Split N1 into outlier and non outlier for next ggplot
OutLierPlotDF <- ErrorMarkedDF%>%#outlier_sars_cov2_adj_load
mutate(outlier_sars_cov2_adj_load = ifelse(FlagedOutliers, sars_cov2_adj_load,NA))%>%
mutate(sars_cov2_adj_load = NoOutlierVar)
OutLierPlotObject <- OutLierPlotDF%>%
filter(!(is.na(sars_cov2_adj_load)&is.na(outlier_sars_cov2_adj_load)))%>%
ggplot(aes(x=Date))+#Data depends on time
geom_line(aes(y=sars_cov2_adj_load,
color="sars_cov2_adj_load",
info = sars_cov2_adj_load))+#compares Var to Cases
geom_point(aes(y = outlier_sars_cov2_adj_load,
color= "outlier_sars_cov2_adj_load",
info = outlier_sars_cov2_adj_load))#+
#ColorRule
#mentioned hand picked list other choices
ggplotly(OutLierPlotObject,tooltip=c("info","Date"))%>%
layout(yaxis = SecondAxisFormat,
legend=list(title=list(text=''),x = 1.15, y = 0.9),
margin = list(l = 50, r = 75, b = 50, t = 50, pad = 4))
Figure 2.1: Wastewater concentration for Madison with potential outliers marked. Using a rolling symmetrical bin of 21 days as a sample we use 2.5 standard deviations of the bin as a metric to reject extreme points. This process is ran multiple times to get a robust process to select outliers.
#Drop Var create Var filter
UpdatedDF <- ErrorMarkedDF%>%
select(-sars_cov2_adj_load)%>%
rename(sars_cov2_adj_load = NoOutlierVar)
The goal in this section is to smooth the data to get a similar effect without losing resolution.
A key component to this is that the relationship between sars_cov2_adj_load and Case involves a gamma distribution modeling both the time between catching Covid-19 and getting a test and the concentration of the shedded particles. We found a gamma distribution with mean 11.73 days and a standard deviation of 7.68 gives good results and matches other research (Fernandez-Cassi et al. 2021).
Mean <- 11.73
StandardDeviation <- 7.68
Scale = StandardDeviation^2/Mean
Shape = Mean/Scale
SLDWidth <- 21
weights <- dgamma(1:SLDWidth, scale = Scale, shape = Shape)
par(mar=c(4,4,4,10))
plot(weights,
main=paste("Gamma Distribution with mean =",Mean, "days,and SD =",StandardDeviation),
ylab = "Weight",
xlab = "Lag")
Figure 3.1: gamma distribution used for shedding lag distribution
SLDSmoothedDF <- UpdatedDF%>%
mutate(
SLDCases = c(rep(NA,SLDWidth-1), #elimination of starting values not relevant
#as we have a 50+ day buffer of case data
rollapply(Cases,width=SLDWidth,FUN=weighted.mean,
w=weights,
na.rm = FALSE)
#,rep(NA,10)
))#no missing data to remove
SLDPlot = SLDSmoothedDF%>%
filter(!is.na(SLDCases))%>%
ggplot(aes(x=Date))+
geom_line(aes(y=Cases,
color="Cases" , info = Cases),alpha=.2)+
geom_line(aes(y=SevenDayMACases,
color="Seven Day MA Cases" , info = SevenDayMACases),alpha=.4)+
geom_line(aes(y=SLDCases, color="SLD Cases",info = SLDCases))+
labs(y="Reported Cases")+
ColorRule
ggplotly(SLDPlot,tooltip=c("info","Date"))%>%
layout(legend=list(title=list(text=''),x = 1.15, y = 0.9),
margin = list(l = 50, r = 75, b = 50, t = 50, pad = 4))
Figure 3.2: Madison Case data for Madison. SLD Cases is a weighted mean of cases using the gamma distribution as the weight distribution.
Cross correlation and Granger Causality are key components to formalize this analysis. Cross correlation looks at the correlation at a range of time shifts and Granger analysis performs a test for predictive power.
CCFChar <- function(ccfObject){
LargestC = max(ccfObject$acf)
Lag = which.max(ccfObject$acf)-21
return(c(LargestC,Lag))
}
ModelTesting <- function(DF,Var1,Var2){
#removing rows from before both series started
UsedDF <- DF%>%
filter(!is.na(Var1),
!is.na(Var2))
Vec1 <- unname(unlist(UsedDF[Var1]))
Vec2 <- unname(unlist(UsedDF[Var2]))
CCFReport <- CCFChar(ccf(Vec1,Vec2,na.action=na.pass,plot = FALSE))
VarPredCase <- grangertest(Vec1, Vec2, order = 1)$"Pr(>F)"[2]
CasePredVar <- grangertest(Vec2,Vec1, order = 1)$"Pr(>F)"[2]
return(round(c(CCFReport,CasePredVar,VarPredCase),4))
}
#ErrorRemovedDF
BaseLine <- ModelTesting(FullDF, "sars_cov2_adj_load","Cases")
BaseLineSevenDay <- ModelTesting(FullDF, "sars_cov2_adj_load","SevenDayMACases")
ErrorRemoved <- ModelTesting(UpdatedDF, "sars_cov2_adj_load","SevenDayMACases")
SLDVar <- ModelTesting(SLDSmoothedDF, "sars_cov2_adj_load","SLDCases")
SevenLoess <- ModelTesting(SLDSmoothedDF, "loess_sars_cov2_adj_load","SevenDayMACases")
SLDLoess <- ModelTesting(SLDSmoothedDF, "loess_sars_cov2_adj_load","SLDCases")
Output <- data.frame(row.names=c("Max Cross Correlation",
"Lag of largest Cross correlation",
"P-value Wastewater predicts Cases",
"P-value Cases predicts wastewater"),
CasesvsVar = BaseLine,
SevenDayMACasesvsVar = BaseLineSevenDay,
ErrorRemoved = ErrorRemoved,
SLDVar = SLDVar,
SevenLoess = SevenLoess,
SLDLoess = SLDLoess)
OutputRightPosition <- data.frame(t(Output))
colnames(OutputRightPosition) <- rownames(Output)
rownames(OutputRightPosition) <- c(paste("Section 1: Cases vs sars_cov2_adj_load"),
paste("Section 1: 7 Day MA Cases vs sars_cov2_adj_load"),
paste("Section 2: Cases vs sars_cov2_adj_load"),
paste(" Section 4.2: SLD Cases vs sars_cov2_adj_load"),
paste("Section 4.3: 7 Day MA Cases vs Loess smoothing of sars_cov2_adj_load"),
paste("Section 4.3: SLD Cases vs Loess smoothing of sars_cov2_adj_load"))
formattable(OutputRightPosition)
| Max Cross Correlation | Lag of largest Cross correlation | P-value Wastewater predicts Cases | P-value Cases predicts wastewater | |
|---|---|---|---|---|
| Section 1: Cases vs sars_cov2_adj_load | 0.6602 | 23 | 0.1908 | 0.0000 |
| Section 1: 7 Day MA Cases vs sars_cov2_adj_load | 0.5251 | -11 | 0.0100 | 0.0017 |
| Section 2: Cases vs sars_cov2_adj_load | 0.7041 | -11 | 0.2413 | 0.9605 |
| Section 4.2: SLD Cases vs sars_cov2_adj_load | 0.6832 | -20 | 0.4858 | 0.0013 |
| Section 4.3: 7 Day MA Cases vs Loess smoothing of sars_cov2_adj_load | 0.6753 | -6 | 0.0000 | 0.0696 |
| Section 4.3: SLD Cases vs Loess smoothing of sars_cov2_adj_load | 0.7555 | -17 | 0.0000 | 0.0000 |